1 Textual Search in Graphics Stream of PDF

نویسنده

C. V. Jawahar

چکیده

Digitized books and manuscripts in digital libraries are often stored as images or graphics. They are not searchable at the content level due to the lack of OCRs or poor quality of the scanned images. Portable Document Format (PDF) has emerged as the most popular document representation schema for wider access across platforms. When there is no textual (UNICODE, ASCII) representation available, scanned images are stored in the graphics stream of PDF. In this paper, we propose a novel solution to search the textual data in graphics stream of the PDF files at content level. The proposed solution is demonstrated by enhancing an open source PDF viewer (Xpdf). Indian language support is also provided. Users can type a word in Roman (ITRANS), view it in a font, and search in textual and graphics stream of PDF documents simultaneously.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards High-Quality Text Stream Extraction from PDF. Technical Background to the ACL 2012 Contributed Task

Extracting textual content and document structure from PDF presents a surprisingly (depressingly, to some, in fact) difficult challenge, owing to the purely display-oriented design of the PDF document standard. While a variety of lower-level PDF extraction toolkits exist, none fully support the recovery of original text (in reading order) and relevant structural elements, even for so-called bor...

متن کامل

A Comparison of Tabular PDF Inversion Methods

The most common form of tabular inversion used in computer graphics is to compute the cumulative distribution table of a pdf and then search within it to transform points, using an O(logn) binary search. Besides the standard inversion method, however, several other discrete inversion algorithms exist that can perform the same transformation in O(1) time per point. In this paper, we examine the ...

متن کامل

Identifying and Ranking the Important Textual and Paratextual Elements in Fiction Retrieval

Purpose: The purpose of this study is to identify the textual and paratextual elements in retrieving fiction from the readers’ perspective in order to provide the most appropriate access points for the readers and to improve access to fictions based on the readers’ needs. Method: The current research is an applied study in terms of purpose, applying a mixed method that was conducted using the ...

متن کامل

XHTML and SVG: Publishing with concept

Electronic Publishing with tools from the Extensible Markup Language (XML) family of technologies has been increasingly used since the first XML and Extensible Style Sheet Language Transformation (XSLT) specifications were been published in 1998/1999 and supporting processing applications emerged. This paper describes ideas and solutions of how to migrate the existing electronic publishing proc...

متن کامل

Combining Visual and Textual Features for Information Extraction from Online Flyers

Information in visually rich formats such as PDF and HTML is often conveyed by a combination of textual and visual features. In particular, genres such as marketing flyers and info-graphics often augment textual information by its color, size, positioning, etc. As a result, traditional text-based approaches to information extraction (IE) could underperform. In this study, we present a supervise...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

1 Textual Search in Graphics Stream of PDF

نویسنده

چکیده

منابع مشابه

Towards High-Quality Text Stream Extraction from PDF. Technical Background to the ACL 2012 Contributed Task

A Comparison of Tabular PDF Inversion Methods

Identifying and Ranking the Important Textual and Paratextual Elements in Fiction Retrieval

XHTML and SVG: Publishing with concept

Combining Visual and Textual Features for Information Extraction from Online Flyers

عنوان ژورنال:

اشتراک گذاری